Introduction
The Elite: Dangerous Database (EDDB) is a collection of data for the game Elite: Dangerous. The data is split into several different databases:
- Prices of goods (listings.csv)
- Stations (stations.json)
- Populated Systems (systems_populated.json)
- Factions (factions,json)
- Commodities (commodities.json)
- Modules (modules.json)
This notebook will be an introduction to each data set and provide some exploratory data analysis of the data sets. This EDA will also act as prototypes for the information I will want to display in the Shiny app.
library(tidyverse)
library(jsonlite)
library(plotly)
Prices
The listings.csv file gives a read out of the prices of goods at the time the file was updated.
listings <- read_csv("data/listings.csv", progress = FALSE)
listings
The columns for the listings are:
- id: The id of the particular good
- station_id: The station where the good is being sold
- commodity_id: The good being sold (related to commodities.json)
- supply: How much of the item is available for purchase
- supply_bracket: Unsure of purpose
- buy_price: Price the good can be bought at
- sell_price: Price the good can be sold at
- demand: How many items the station wants
- demand_bracket: Unsure of purpose
- collected_at: Time the data was collected
We can then look at a summary of the data to get an idea of what is in the data set.
summary(listings)
id station_id commodity_id supply
Min. : 1 Min. : 1 Min. : 1.00 Min. : 0
1st Qu.:1187777 1st Qu.:13470 1st Qu.: 22.00 1st Qu.: 0
Median :2338836 Median :27759 Median : 49.00 Median : 0
Mean :2615026 Mean :29000 Mean : 87.71 Mean : 27006
3rd Qu.:3698904 3rd Qu.:42522 3rd Qu.:103.00 3rd Qu.: 31
Max. :6900247 Max. :69881 Max. :330.00 Max. :23917110
supply_bracket buy_price sell_price demand
Min. :0.00 Min. : 0.0 Min. : 0 Min. : 0
1st Qu.:0.00 1st Qu.: 0.0 1st Qu.: 547 1st Qu.: 0
Median :0.00 Median : 0.0 Median : 1357 Median : 664
Mean :0.56 Mean : 535.2 Mean : 5247 Mean : 116685
3rd Qu.:1.00 3rd Qu.: 63.0 3rd Qu.: 3980 3rd Qu.: 12064
Max. :3.00 Max. :113355.0 Max. :257990 Max. :583601700
NA's :218269
demand_bracket collected_at
Min. :0.00 Min. :1.454e+09
1st Qu.:0.00 1st Qu.:1.523e+09
Median :2.00 Median :1.527e+09
Mean :1.81 Mean :1.523e+09
3rd Qu.:3.00 3rd Qu.:1.528e+09
Max. :3.00 Max. :1.529e+09
NA's :218269
From the summary, we can see that there are number of NA’s in the supply_bracket and demand_backet columns. Since we don’t know what those do, we can ignore those for the time being. Additionally, there are zeroes in the sell_price and buy_price columns. Those are essentially NA’s as a sell_price of 0 means that you can’t sell the product there and a buy_price of 0 means there are none available at that station.
listings %>%
filter(commodity_id == 1) %>%
select(buy_price, sell_price) %>%
gather(type, price) %>%
filter(price > 0) %>%
group_by(type) %>%
mutate(average = mean(price), median = median(price)) %>%
ggplot(aes(x = price, fill = type)) +
geom_histogram(bins = 100) +
facet_grid(type ~.) +
geom_vline(aes(xintercept = average)) +
geom_vline(aes(xintercept = median), linetype = "dotted")

This figure gives a quick histogram of the buy and sell prices of a single commodity across all available stations. Note that it filters out any buy or sell prices of 0. The solid line is the mean of the distribution and the dotted line is the median of the distribution. From the figure, an interesting topic to investigate is looking at comparative buy and sell prices across stations to create a system to search for the max difference.
listings %>%
filter(station_id == 12) %>%
select(buy_price, sell_price) %>%
gather(type, price) %>%
filter(price > 0) %>%
group_by(type) %>%
mutate(average = mean(price), median = median(price)) %>%
ggplot(aes(x = price, fill = type)) +
geom_histogram(bins = 20) +
facet_grid(type ~.) +
geom_vline(aes(xintercept = average)) +
geom_vline(aes(xintercept = median), linetype = "dotted")

This figure looks at a histogram of buy and sell prices for a specific station (across all available commodities). Again, the solid line is the mean and the dotted line is the median.
Questions
I want to put together some initial questions to answer using my Shiny app.
- What is the biggest price difference?
- What is the distance between the stations with the biggest price difference?
- Given a particular commodity, where can I find it and at what price?
- Given a particular commodity, where can I sell it and at what price?
- How do the buying and selling price compare the the galatic average (and reported average average and median)?
Stations
stations <- fromJSON("data/stations.json")
Error in fromJSON("data/stations.json") :
could not find function "fromJSON"
The stations data has 39 columns in it, the most relevant of which are the following:
- id: The station’s id
- name: The station’s name
- system_id: The system id where the station resides
- updated_at: Time at which the data was updated
- distance_to_star: Distance of the station from the system’s star
- has_ : A set of booleans showing what amenities the station offers
- is_planetary: A boolean showing whether the station is on a planet
- selling_ships: A list of the ships being sold
There are also a number of other useful variables showing additional information about the stations.
We can summarize the most relevant columns and get a sense of distributions and any possible NAs.
stations %>%
select(id, system_id, updated_at, distance_to_star, starts_with("has_"), is_planetary) %>%
summary()
id system_id updated_at distance_to_star
Min. : 1 Min. : 1 Min. :1.479e+09 Min. : 2
1st Qu.:18114 1st Qu.: 4843 1st Qu.:1.527e+09 1st Qu.: 200
Median :35262 Median : 10419 Median :1.528e+09 Median : 868
Mean :35128 Mean : 62959 Mean :1.527e+09 Mean : 17005
3rd Qu.:52285 3rd Qu.: 15469 3rd Qu.:1.529e+09 3rd Qu.: 2670
Max. :69883 Max. :17805395 Max. :1.529e+09 Max. :6783706
NA's :1951
has_blackmarket has_market has_refuel has_repair
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:46586 FALSE:14485 FALSE:7157 FALSE:13333
TRUE :21261 TRUE :53362 TRUE :60690 TRUE :54514
has_rearm has_outfitting has_shipyard has_docking
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:21408 FALSE:23268 FALSE:51848 FALSE:5548
TRUE :46439 TRUE :44579 TRUE :15999 TRUE :62299
has_commodities is_planetary
Mode :logical Mode :logical
FALSE:16943 FALSE:40952
TRUE :50904 TRUE :26895
We can also extract a list of possible ships that are sold and from that we can determine what stations sell which ships.
stations %>%
select(selling_ships) %>%
unnest() %>%
distinct()
stations %>%
select(id, name, system_id, selling_ships) %>%
unnest() %>%
filter(selling_ships == "Imperial Cutter") %>%
head()
For instance, we can get a listing of all of the stations that sell the “Imperial Cutter” and what system that station is in.
Questions
- What are the list of purchasable ships?
- What ship can be bought where?
- What module can be bought where?
Populated Systems
The populated systems data gives information on systems in the universe that are populated.
populated_systems <- as_tibble(fromJSON("data/systems_populated.json"))
head(populated_systems)
Like the stations data, the populated systems data has a large number of columns (29) that are mostly metadata about the systems. Most of the metadata is centered on government types, alleigance, security level, power play info, and factions. The information that will likely be useful for us are the following columns:
- id: The system’s id
- name: the system’s name
- x, y, z: The coordinates of the system in the universe
First, I want to look at the summary statistics for the coordinates to verify that there are no missing coordinates. It looks like all of the coordinates are present.
populated_systems %>%
select(x, y, z) %>%
summary()
x y z
Min. :-9557.94 Min. :-944.12 Min. :-6947.56
1st Qu.: -49.50 1st Qu.: -99.69 1st Qu.: -47.94
Median : 17.23 Median : -25.05 Median : 14.75
Mean : -20.58 Mean : -35.68 Mean : 84.76
3rd Qu.: 83.07 3rd Qu.: 34.25 3rd Qu.: 74.38
Max. : 2704.97 Max. : 366.66 Max. :19853.19
I can make a 3D scatter plot of the coordinates of the systems and show the color of the controlling power. All of this done with plotly and could be a fun way of showing where a particular system is in comparison to other systems. In this plot I’ve only included the top 4 controlling powers (based on number of systems controlled). Since I haven’t used plotly much, this could be a great opportunity to become more familiar with it.
p <- populated_systems %>%
filter(!is.na(power)) %>%
group_by(power) %>%
mutate(control_number = n()) %>%
ungroup() %>%
mutate(control_number = dense_rank(desc(control_number))) %>%
filter(control_number < 5) %>%
plot_ly(x = ~x, y = ~y, z = ~z, color = ~power) %>%
add_markers(opacity = 0.1)
p
Questions
- Where are the various systems?
- What powers or factions control each system?
Factions
The factions data set gives metadata on all of the factions involved in the game.
factions <- as_tibble(fromJSON("data/factions.json"))
head(factions)
I am unlikely to use this data set for this particular project, but it could provide additional information if I want to look at specific factions, particularly player created factions.
---
title: "Introduction and EDA for EDDB"
author: "Brian Richards"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

## Introduction

The Elite: Dangerous Database (EDDB) is a collection of data for the game 
Elite: Dangerous. The data is split into several different databases: 

* Prices of goods (listings.csv)
* Stations (stations.json)
* Populated Systems (systems_populated.json)
* Factions (factions,json)
* Commodities (commodities.json)
* Modules (modules.json)

This notebook will be an introduction to each data set and provide some 
exploratory data analysis of the data sets. This EDA will also act as 
prototypes for the information I will want to display in the Shiny app.

```{r, message = FALSE}
library(tidyverse)
library(jsonlite)
library(plotly)
```

## Prices

The `listings.csv` file gives a read out of the prices of goods at the time 
the file was updated.

```{r, message = FALSE}
listings <- read_csv("data/listings.csv", progress = FALSE)
```

```{r}
head(listings)
```

The columns for the `listings` are:

* id: The id of the particular good
* station_id: The station where the good is being sold
* commodity_id: The good being sold (related to commodities.json)
* supply: How much of the item is available for purchase
* supply_bracket: Unsure of purpose
* buy_price: Price the good can be bought at 
* sell_price: Price the good can be sold at
* demand: How many items the station wants
* demand_bracket: Unsure of purpose
* collected_at: Time the data was collected

We can then look at a summary of the data to get an idea of what is in the 
data set.

```{r}
summary(listings)
```

From the summary, we can see that there are number of NA's in the `supply_bracket` 
and `demand_backet` columns. Since we don't know what those do, we can ignore 
those for the time being. Additionally, there are zeroes in the `sell_price` 
and `buy_price` columns. Those are essentially NA's as a `sell_price` of 0 means 
that you can't sell the product there and a `buy_price` of 0 means there are 
none available at that station.

```{r}
listings %>% 
  filter(commodity_id == 1) %>% 
  select(buy_price, sell_price) %>% 
  gather(type, price) %>% 
  filter(price > 0) %>% 
  group_by(type) %>% 
  mutate(average = mean(price), median = median(price)) %>% 
  ggplot(aes(x = price, fill = type)) +
  geom_histogram(bins = 100) +
  facet_grid(type ~.) +
  geom_vline(aes(xintercept = average)) +
  geom_vline(aes(xintercept = median), linetype = "dotted")
```

This figure gives a quick histogram of the buy and sell prices of a single 
commodity across all available stations. Note that it filters out any buy or 
sell prices of 0. The solid line is the mean of the distribution and the dotted 
line is the median of the distribution. From the figure, an interesting topic 
to investigate is looking at comparative buy and sell prices across stations 
to create a system to search for the max difference.

```{r}
listings %>% 
  filter(station_id == 12) %>% 
  select(buy_price, sell_price) %>% 
  gather(type, price) %>% 
  filter(price > 0) %>% 
  group_by(type) %>% 
  mutate(average = mean(price), median = median(price)) %>% 
  ggplot(aes(x = price, fill = type)) +
  geom_histogram(bins = 20) +
  facet_grid(type ~.) +
  geom_vline(aes(xintercept = average)) +
  geom_vline(aes(xintercept = median), linetype = "dotted")
```

This figure looks at a histogram of buy and sell prices for a specific station 
(across all available commodities). Again, the solid line is the mean and the 
dotted line is the median.

### Questions 

I want to put together some initial questions to answer using my Shiny app. 

* What is the biggest price difference?
* What is the distance between the stations with the biggest price difference? 
* Given a particular commodity, where can I find it and at what price?
* Given a particular commodity, where can I sell it and at what price?
* How do the buying and selling price compare the the galatic average (and 
reported average average and median)?

## Stations 

```{r}
stations <- as_tibble(fromJSON("data/stations.json"))
```

```{r}
head(stations)
```

The stations data has 39 columns in it, the most relevant of which are the 
following:

* id: The station's id
* name: The station's name
* system_id: The system id where the station resides
* updated_at: Time at which the data was updated
* distance_to_star: Distance of the station from the system's star
* has_ : A set of booleans showing what amenities the station offers
* is_planetary: A boolean showing whether the station is on a planet
* selling_ships: A list of the ships being sold

There are also a number of other useful variables showing additional information 
about the stations.

We can summarize the most relevant columns and get a sense of distributions and 
any possible NAs.

```{r}
stations %>% 
  select(id, system_id, updated_at, distance_to_star, starts_with("has_"), is_planetary) %>% 
  summary()
```

We can also extract a list of possible ships that are sold and from that we 
can determine what stations sell which ships.

```{r}
stations %>% 
  select(selling_ships) %>% 
  unnest() %>% 
  distinct()
```

```{r}
stations %>% 
  select(id, name, system_id, selling_ships) %>% 
  unnest() %>% 
  filter(selling_ships == "Imperial Cutter") %>% 
  head()
```

For instance, we can get a listing of all of the stations that sell the 
"Imperial Cutter" and what system that station is in. 

### Questions 

* What are the list of purchasable ships?
* What ship can be bought where?
* What module can be bought where?

## Populated Systems

The populated systems data gives information on systems in the universe that 
are populated.

```{r}
populated_systems <- as_tibble(fromJSON("data/systems_populated.json"))
```

```{r}
head(populated_systems)
```

Like the stations data, the populated systems data has a large number of 
columns (29) that are mostly metadata about the systems. Most of the metadata 
is centered on government types, alleigance, security level, power play info, 
and factions. The information that will likely be useful for us are the 
following columns: 

* id: The system's id
* name: the system's name
* x, y, z: The coordinates of the system in the universe

First, I want to look at the summary statistics for the coordinates to verify 
that there are no missing coordinates. It looks like all of the coordinates 
are present.

```{r}
populated_systems %>% 
  select(x, y, z) %>% 
  summary()
```

I can make a 3D scatter plot of the coordinates of the systems and show the 
color of the controlling power. All of this done with plotly and could be a fun 
way of showing where a particular system is in comparison to other systems. In 
this plot I've only included the top 4 controlling powers (based on number of 
systems controlled). Since I haven't used plotly much, this could be a great 
opportunity to become more familiar with it.

```{r}
p <- populated_systems %>%
  filter(!is.na(power)) %>% 
  group_by(power) %>%
  mutate(control_number = n()) %>%
  ungroup() %>% 
  mutate(control_number = dense_rank(desc(control_number))) %>% 
  filter(control_number < 5) %>%
  plot_ly(x = ~x, y = ~y, z = ~z, color = ~power) %>% 
  add_markers(opacity = 0.1)

p
```

### Questions 

* Where are the various systems?
* What powers or factions control each system?

## Factions

The factions data set gives metadata on all of the factions involved in the 
game.

```{r}
factions <- as_tibble(fromJSON("data/factions.json"))
```

```{r}
head(factions)
```

I am unlikely to use this data set for this particular project, but it could 
provide additional information if I want to look at specific factions, 
particularly player created factions.

